2020/02/09

## virtualenv: pycas2020

Good Way to Learn Python:

But, Who Actually Reads These A to Z?

(spoiler: not me)

me and my programming books

me and my programming books

What We Really Need to Know:

  • the tools available in BeautifulSoup and requests
  • what to look for in html code
  • parsing json objects with json
  • rudimentary pandas skills
  • <pro_tip> All you need to know about html is how tags work </protip>

What to Look for in a Scraping Project:

Structured data with a regular repeatable format.

Identical formating is not required, but the more edge cases present, the more complicated the scraping will be.

Ethics in Scraping

Accessing vast troves of information can be intoxicating.

Just because we can doesn’t mean we should

Legal Considerations

Dollar Stores are Taking Over the World!

Store in Cascade, Idaho

Store in Cascade, Idaho

Goal: Extract addresses for all Family Dollar stores in Idaho.

The Starting Point:

Step 1: Load the Libraries

import requests # for making standard html requests
from bs4 import BeautifulSoup # magical tool for parsing html data
import json # for parsing data
from pandas import DataFrame as df # data organization

Step 2: Grab Some Data from Target Web Address

page = requests.get("https://locations.familydollar.com/id/")
soup = BeautifulSoup(page.text, 'html.parser') 

Beautiful Soup will take html or xml content and transform it into a complex tree of objects. Here are several common types:

  • BeautifulSoup - the soup (the parsed content)
  • Tag - main type of bs4 element you will encounter
  • NavigableString - string within a tag
  • Comment - special type of NavigableString

Step 3: Determine How to Extract Relevant Content from bs4 Soup

This can be frustrating.

Step 3: Finding Content…

  • Start with one representative example and then scale up
  • Viewing the page’s html source code is essential
    • Run at your own risk:
print(soup.prettify())
  • It is usually easiest to browse via “View Page Source”:

Step 3: Finding Content by Searching

Searching for href does not work.

dollar_tree_list = soup.find_all('href')
dollar_tree_list
## []

But searching on a specific class is often successful:

dollar_tree_list = soup.find_all(class_ = 'itemlist')
for i in dollar_tree_list[:2]:
  print(i)
## <div class="itemlist"><a dta-linktrack="City index page - Aberdeen" href="https://locations.familydollar.com/id/aberdeen/">Aberdeen</a></div>
## <div class="itemlist"><a dta-linktrack="City index page - American Falls" href="https://locations.familydollar.com/id/american-falls/">American Falls</a></div>

Step 3: Finding Content by Using ‘contents’

What kind of content do we have and how much is there?

type(dollar_tree_list)
## <class 'bs4.element.ResultSet'>
len(dollar_tree_list)
## 48

Now that we have drilled down to a BeautifulSoup “ResultSet”, we can try extracting the contents.

example = dollar_tree_list[2] # Arco, ID (single representative example)
example_content = example.contents
print(example_content)
## [<a dta-linktrack="City index page - Arco" href="https://locations.familydollar.com/id/arco/">Arco</a>]

Step 3: Finding Content in Attributes

Find out what attributes are present in the contents:

Note: contents usually return a list of exactly one item, so the first step is to index that item.

example_content = example.contents[0]
example_content.attrs
## {'dta-linktrack': 'City index page - Arco', 'href': 'https://locations.familydollar.com/id/arco/'}

Extract the relevant attribute:

example_href = example_content['href']
print(example_href)
## https://locations.familydollar.com/id/arco/

Step 4: Extract the Relevant Content

city_hrefs = [] # initialise empty list

for i in dollar_tree_list:
    cont = i.contents[0]
    href = cont['href']
    city_hrefs.append(href)

#  check to be sure all went well
for i in city_hrefs[:2]:
  print(i)
## https://locations.familydollar.com/id/aberdeen/
## https://locations.familydollar.com/id/american-falls/

We now have a list of URL’s of Family Dollar stores in Idaho to scrape.

Repeat Steps 1-4 for the City URLs

page2 = requests.get(city_hrefs[2]) # representative example
soup2 = BeautifulSoup(page2.text, 'html.parser')

Extract Address Information

from type="application/ld+json"

arco = soup2.find_all(type="application/ld+json")
print(arco[1])
## <script type="application/ld+json">
##  {
##    "@context":"https://schema.org",
##    "@type":"Schema Business Type",
##    "name": "Family Dollar #9143",
##    "address":{
##      "@type":"PostalAddress",
##      "streetAddress":"157 W Grand Avenue",
##      "addressLocality":"Arco",
##      "addressRegion":"ID",
##      "postalCode":"83213",
##      "addressCountry":"US"
##    },
##    "containedIn":"",  
##    "branchOf": {
##      "name":"Family Dollar",
##      "url": "https://www.familydollar.com/"
##    },
##    "url":"https://locations.familydollar.com/id/arco/29143/",
##    "telephone":"208-881-5738",
##    "image": "//hosted.where2getit.com/familydollarstore/images/storefront.png"
##  }           
##  </script>

(address information is in the second list member)

Use ‘contents’ to Find Address Information

Extract the contents (from the second list item) and index the first (and only) list item:

arco_contents = arco[1].contents[0]
arco_contents
## '\n\t{\n\t  "@context":"https://schema.org",\n\t  "@type":"Schema Business Type",\n\t  "name": "Family Dollar #9143",\n\t  "address":{\n\t    "@type":"PostalAddress",\n\t    "streetAddress":"157 W Grand Avenue",\n\t    "addressLocality":"Arco",\n\t    "addressRegion":"ID",\n\t    "postalCode":"83213",\n\t    "addressCountry":"US"\n\t  },\n\t  "containedIn":"",  \n\t  "branchOf": {\n\t    "name":"Family Dollar",\n\t    "url": "https://www.familydollar.com/"\n\t  },\n\t  "url":"https://locations.familydollar.com/id/arco/29143/",\n\t  "telephone":"208-881-5738",\n\t  "image": "//hosted.where2getit.com/familydollarstore/images/storefront.png"\n\t}\t\t\t\n\t'

Next, convert to a json object:
(these are way easier to work with)

arco_json =  json.loads(arco_contents)

Extract Content from a json Object

This is actually a dictionary:

type(arco_json)
## <class 'dict'>
print(arco_json)
## {'@context': 'https://schema.org', '@type': 'Schema Business Type', 'name':
'Family Dollar #9143', 'address': {'@type': 'PostalAddress', 'streetAddress': '157 W
Grand Avenue', 'addressLocality': 'Arco', 'addressRegion': 'ID', 'postalCode':
'83213', 'addressCountry': 'US'}, 'containedIn': '', 'branchOf': {'name': 'Family
Dollar', 'url': 'https://www.familydollar.com/'}, 'url':
'https://locations.familydollar.com/id/arco/29143/', 'telephone': '208-881-5738',
'image': '//hosted.where2getit.com/familydollarstore/images/storefront.png'}

Extract Content from a json Object

arco_address = arco_json['address']
arco_address
## {'@type': 'PostalAddress', 'streetAddress': '157 W Grand Avenue',
'addressLocality': 'Arco', 'addressRegion': 'ID', 'postalCode':
'83213', 'addressCountry': 'US'}

Step 5: Put It All Togther

Iterate over the list store URLs in Idaho:

locs_dict = [] # initialise empty list

for link in city_hrefs:
  locpage = requests.get(link)   # request page info
  locsoup = BeautifulSoup(locpage.text, 'html.parser') 
      # parse the page's content
  locinfo = locsoup.find_all(type="application/ld+json") 
      # extract specific element
  loccont = locinfo[1].contents[0]  
      # get contents from the bs4 element set
  locjson = json.loads(loccont)  # convert to json
  locaddr = locjson['address'] # get address
  locs_dict.append(locaddr) # add address to list

Step 6: Finalise Data

locs_df = df.from_records(locs_dict)
locs_df.drop(['@type', 'addressCountry'], axis = 1, inplace = True)
locs_df.head(n = 5)
##         streetAddress addressLocality addressRegion postalCode
## 0   111 N Main Street        Aberdeen            ID      83210
## 1     253 Harrison St  American Falls            ID      83211
## 2  157 W Grand Avenue            Arco            ID      83213
## 3     177 Main Street          Ashton            ID      83420
## 4     747 N. Main St.        Bellevue            ID      83313

Results!!

df.to_csv(locs_df, "family_dollar_ID_locations.csv", sep = ",", index = False)

A Few Words on Selenium

“Inspect Element” provides the code for what we actually see in a browser.

A Few Words on Selenium

“View Page Source” - provides the code for what requests will obtain There is javascript modifying the source code. The source code needs to be accessed after the page has loaded in a browser.

A Few Words on Selenium

  • Requires a webdriver to retrieve the content
  • It actually opens a web browser, and this info is collected
  • Selenium is powerful - it can interact with loaded content in many ways
  • After getting data, continue to use BeautifulSoup as before
url = "https://www.walgreens.com/storelistings/storesbycity.jsp?requestType=locator&state=ID"
driver = webdriver.Firefox(executable_path = 'mypath/geckodriver.exe')
driver.get(url)
soup_ID = BeautifulSoup(driver.page_source, 'html.parser')
store_link_soup = soup_ID.find_all(class_ = 'col-xl-4 col-lg-4 col-md-4') 

The Penultimate Slide

~ After We Become Web Scraping Masters ~

Bonus Slide!

Dollar Stores in America

Dollar Stores in America